# A tibble: 5 × 22
customer_ID age gender item_purchased category purchase_amount_USD location
<dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 3633 27 Female Sneakers Footwear 73 Utah
2 3127 57 Female Sunglasses Accessor… 76 Maine
3 999 51 Male Shoes Footwear 90 Connect…
4 417 36 Male Belt Accessor… 55 Oregon
5 3410 24 Female Shirt Clothing 93 Minneso…
# ℹ 15 more variables: size <chr>, color <chr>, season <chr>,
# review_rating <dbl>, subscription_status <chr>, shipping_type <chr>,
# discount_applied <chr>, promo_code_used <chr>, previous_purchases <dbl>,
# payment_method <chr>, forequency_of_purchases <chr>, age_group <fct>,
# generation <fct>, numeric_age_group <dbl>, numeric_generation <dbl>
Final Project
Predictive Factors of Age on Purchasing
1. Introduction
There has been much talk lately of the differing behaviors of the different generations, especially as Gen-Z is about to enter the adult stage of life. Online there has been a lot of dialogue. For example, the catchphrase "ok, boomer" gained significant attention, comparing millennials and Gen-Zs is a common trend, and Generation Alpha has entered their school-age years which has led to online discourse about their lack of social skills due to technology. The different general behaviors of the generations has many implications for marketers. This study aims to better understand how age influences purchase behavior. The Baby Boomer generation is made up of those born between 1945 and 1963 which includes people from age 60 to age 78. The next generation, Generation X, is made up of people born between 1964 and 1978 which includes people from age 45 to age 59. Generation X is followed by Millenials who were born between 1979 and 1993 which includes people between the age of 30 and 44. The last generation relevant to this study is Generation Z, born between 1994 and 2011 who are now between the ages of 12 and 29.
To better understand how age influences shopping behavior, I used a data set from Kaggle that compiled information on consumers that, "includes demographic information, purchase history, product preferences, and preferred shopping channels (online or offline) (Kaggle)." The data was last updated October, 2023. Each row of data represents a different individual consumer.
Here is a snapshot of 5 randomly chosen rows of the data set we'll use:
2. Exploratory Data Analysis
We had an original sample size of 3,900. None of the participants in the sample had missing objects so our total sample size remained 3,900.
Table 1. Summary Statistics by generation of number of participants, mean and standard deviation for previous_purchases.
# A tibble: 4 × 4
generation count mean_previous_purchases sd_previous_purchases
<fct> <dbl> <dbl> <dbl>
1 Baby Boomers 863 25.9 14.3
2 Generation X 1118 25.6 14.5
3 Generation Z 802 24.6 14.3
4 Millenials 1117 25.2 14.6
Table 2. Summary Statistics by generation of mean and standard deviation of review rating and mean and standard deviation amount purchased in USD.
# A tibble: 4 × 5
generation mean_review sd_review mean_purchase_amount sd_purchase_amount
<fct> <dbl> <dbl> <dbl> <dbl>
1 Baby Boomers 3.75 0.710 59.4 23.9
2 Generation X 3.72 0.721 60.0 23.8
3 Generation Z 3.79 0.715 60.4 23.9
4 Millenials 3.75 0.716 59.4 23.3
Table 3. Summary Statistics by age group of number of participants by age group and mean and standard deviation of previous purchases.
# A tibble: 10 × 4
age_group count mean_previous_purchases sd_previous_purchases
<fct> <dbl> <dbl> <dbl>
1 18-24 418 24.1 14.1
2 25-29 384 25.1 14.5
3 30-34 371 25.0 14.6
4 35-39 361 25.1 14.8
5 40-44 385 25.5 14.6
6 45-49 338 23.6 14.1
7 50-54 382 26.3 14.6
8 55-59 398 26.5 14.5
9 60-64 363 25.2 14.2
10 65-70 500 26.4 14.3
Table 4. Summary Statistics by age group of mean and standard deviation of review rating and mean and standard deviation of amount purchased in USD.
# A tibble: 10 × 5
age_group mean_review sd_review mean_purchase_amount sd_purchase_amount
<fct> <dbl> <dbl> <dbl> <dbl>
1 18-24 3.81 0.730 59.7 23.7
2 25-29 3.77 0.699 61.0 24.1
3 30-34 3.76 0.717 60.6 23.2
4 35-39 3.73 0.698 59.5 23.5
5 40-44 3.76 0.734 58.2 23.0
6 45-49 3.71 0.714 57.1 23.6
7 50-54 3.72 0.742 63.1 23.9
8 55-59 3.72 0.708 59.4 23.6
9 60-64 3.73 0.727 59.4 23.4
10 65-70 3.77 0.698 59.3 24.3
The generation with the greatest mean amount of previous purchases were the Baby Boomers (n = 863, mean = 25.93, median = 26, sd = 14.26), however when split by age group, the group with the greatest mean amount of previous purchases were those aged 55 to 59 (n = 398, mean = 26.28, median = 27, sd = 14.58), members of Generation X. However, the group with the second greatest mean amount of previous purchases were those aged 65 to 70 (n = 500, mean = 26.42, median = 27, sd = 14.29) who are indeed apart of the Baby Boomer generation.
The generation who spent the greatest average amount of money in USD on their purchase was Generation Z (n = 802, mean = 60.36, median = 61, sd = 23.88). However, the age group who spent the greatest average amount were aged 50 to 54 (n = 382, mean = 63.06, median = 64, sd = 23.90), Generation Xers, followed by the Generation Z age group between the ages of 25 and 29 (n = 384, mean = 61.04, median = 61.5, sd = 24.07).
Finally, the generation with highest average review rating were the Generation Z (n = 802, mean = 3.79, median = 3.8, sd = 0.71) and Millenials (n = 1117, mean = 3.75, median = 3.8, sd = 0.72). The age group with the highest average review rating were ages 18 to 24 (n = 418, mean = 3.81, median = 3.9, sd = 0.73), Generation Zers. They were followed by those aged 25 to 29 (n = 384, mean = 3.77, median = 3.8, sd = 0.73), the other Generation Z age group.
Exploratory analysis leads us to ask if there are any other variables across age that influence purchase behaviors.
3. Multiple Linear Regression
3.1.1. Model 1 Methods
The components of my multiple linear regression model are the following:
outcome variable y1 = purchase amount in USD
Numerical explanatory variable x1 = age
Categorical explanatory variable x2 = frequency of purchases
We want to know if the relationship between age and purchase_amount_USD is conditional on one’s frequency of purchase.
Table 5. Regression table for interaction model of amount purchased in USD as a function of age and frequency of purchases.
# A tibble: 14 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 65.7 3.03 21.7 7.30e-99 5.98e+1 71.7
2 age -0.125 0.0640 -1.95 5.13e- 2 -2.50e-1 0.000680
3 forequency_of_purch… -8.87 4.29 -2.06 3.91e- 2 -1.73e+1 -0.445
4 forequency_of_purch… -4.78 4.22 -1.13 2.57e- 1 -1.31e+1 3.49
5 forequency_of_purch… -5.24 4.33 -1.21 2.26e- 1 -1.37e+1 3.24
6 forequency_of_purch… -8.31 4.38 -1.90 5.81e- 2 -1.69e+1 0.285
7 forequency_of_purch… -8.24 4.30 -1.92 5.53e- 2 -1.67e+1 0.188
8 forequency_of_purch… -1.51 4.43 -0.342 7.33e- 1 -1.02e+1 7.18
9 age:forequency_of_p… 0.213 0.0923 2.31 2.11e- 2 3.20e-2 0.394
10 age:forequency_of_p… 0.105 0.0905 1.16 2.48e- 1 -7.29e-2 0.282
11 age:forequency_of_p… 0.0915 0.0927 0.987 3.24e- 1 -9.03e-2 0.273
12 age:forequency_of_p… 0.167 0.0933 1.80 7.27e- 2 -1.54e-2 0.350
13 age:forequency_of_p… 0.180 0.0910 1.98 4.76e- 2 1.94e-3 0.359
14 age:forequency_of_p… 0.00698 0.0940 0.0742 9.41e- 1 -1.77e-1 0.191
3.1.2. Model 1 Results
- Since “annually” comes first alphabetically, people who shop annually are the “baseline comparison group”. Therefore, the intercept (b0 = 65.75) represents the intercept for only the annual group.
The estimate for the slope for age (bage = -0.12) is the associated change in purchase amount for every increase of one year in age. Every increase of one year, there is a 0.12 decrease in amount purchased.
The estimate for the following purchasing frequencies are the offsets in intercept relative to the annual group (baseline).
3.1.3. Model 1 Interpretation
Using the output of our regression table we'll test two different null hypotheses. The first null hypothesis is that there is no relationship between age and amount purchased in USD at the population level (the population slope is zero).
There appears to be a possible negative relationship between age and amount purchased in USD for consumers Bage = -0.12. However, this does not appear to be a meaningful relationship since in the table, we see
the 95% confidence interval for the population slope Bage (-0.250, 0.00068), zero is included within this interval
Although, the p-value (p = 0.051) is less than 0.1, there is still weak evidence against the null hypothesis
The null hypothesis cannot be confidently rejected
The second set of null hypotheses that we are test are that all the differences in intercept for the non-baseline groups are zero.
the 95% confidence intervals for the population difference in intercept Bquarterly (0.032, 0.28) and Bbiweekly (-17.29, -0.45) are the only ones that do not include 0. So it is plausible that the difference of all intercepts, except Bquarterly and Bbiweekly, are zero, hence it is plausible that all intercepts are the same.
The majority of the p-values are too large to reject the null hypothesis. However, the p-value for Bquarterly is 0.021 and Bbiweekly is 0.039.
3.2.1. Model 2 Methods
The components of my multiple linear regression model are the following:
outcome variable y1 = purchase amount in USD
Numerical explanatory variable x1 = previous purchases
Categorical explanatory variable x2 = age group
We want to know if the relationship between amount of previous purchases and purchase_amount_USD is conditional on one’s age group.
Table 6. Regression table for interaction model of amount purchased in USD as a function of age group and previous purchases.
# A tibble: 20 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 60.3 2.29 26.4 4.91e-141 55.9 64.8
2 previous_purchases -0.0253 0.0820 -0.309 7.57e- 1 -0.186 0.135
3 age_group25-29 -1.35 3.33 -0.405 6.85e- 1 -7.87 5.17
4 age_group30-34 0.273 3.35 0.0817 9.35e- 1 -6.29 6.83
5 age_group35-39 -0.746 3.36 -0.222 8.25e- 1 -7.34 5.85
6 age_group40-44 -2.10 3.34 -0.630 5.29e- 1 -8.65 4.44
7 age_group45-49 -4.69 3.40 -1.38 1.68e- 1 -11.4 1.98
8 age_group50-54 6.35 3.39 1.87 6.12e- 2 -0.297 13.0
9 age_group55-59 -1.58 3.37 -0.468 6.40e- 1 -8.18 5.03
10 age_group60-64 -1.26 3.42 -0.370 7.12e- 1 -7.96 5.44
11 age_group65-70 -3.19 3.19 -1.00 3.17e- 1 -9.46 3.07
12 previous_purchases… 0.107 0.117 0.915 3.60e- 1 -0.122 0.336
13 previous_purchases… 0.0259 0.118 0.221 8.25e- 1 -0.205 0.256
14 previous_purchases… 0.0228 0.118 0.193 8.47e- 1 -0.208 0.254
15 previous_purchases… 0.0253 0.116 0.217 8.28e- 1 -0.203 0.254
16 previous_purchases… 0.0880 0.123 0.716 4.74e- 1 -0.153 0.329
17 previous_purchases… -0.113 0.117 -0.964 3.35e- 1 -0.342 0.116
18 previous_purchases… 0.0482 0.116 0.417 6.77e- 1 -0.179 0.275
19 previous_purchases… 0.0399 0.120 0.332 7.40e- 1 -0.195 0.275
20 previous_purchases… 0.108 0.111 0.976 3.29e- 1 -0.109 0.325
3.2.2. Model 2 Results
First, since 18-24 comes numerically before the other age groups, the 18-24 age group is the "baseline for comparison" group. Thus, intercept is the intercept for the 18-24 group.
This holds similarly for previous_purchases. It is the slope for previous_purchases for only the 18-24 group. Thus, the regression line will have an intercept of 60.34 and slope for previous_purchases of -0.025.
The values for the following age groups are not their intercepts, but rather the offset in intercept for that specific age group relative to the 18-24 age group. The intercept for the other age groups are the intercept + the estimate for said age group.
Similarly, the age groups x the previous_purchases are not the slopes for the other age groups, but rather the offset in slope for those age groups. Therefore, the slopes for age groups are age group estimate + age group x previous_purchases estimate.
3.2.3. Model 2 Interpretation
The first null hypothesis is that there is no relationship between previous purchases and amount purchased in USD at the population level (the population slope is zero).
There appears to be a possible negative relationship between previous purchases and amount purchased in USD for consumers Bpreviouspurchases = -0.025. However, this does not appear to be a meaningful relationship since in the table, we see
the 95% confidence interval for the population slope Bpreviouspurchases (-0.19, 0.0.14), zero is included within this interval
The p-value (p = 0.76) is much greater than 0.1, there is no evidence against the null hypothesis
The null hypothesis cannot be rejected.
The second set of null hypotheses that we are test are that all the differences in intercept for the non-baseline groups are zero.
All of the 95% confidence intervals contain zero, therefore it is plausible that all intercepts are the same.
All of the p-values are too large to reject the null hypothesis.
3.3.1. Model 3
The components of my multiple linear regression model are the following:
outcome variable y1 = review rating
Numerical explanatory variable x1 = age
Categorical explanatory variable x2 = discount applied
We want to know if the relationship between age and review rating is conditional on whether or not their was a discount applied to their purchase.
Table 7. Regression table for interaction model of review rating as a function of age group and whether or not there was a discount applied.
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.86 0.0468 82.6 0 3.77e+0 3.95
2 age -0.00237 0.00100 -2.36 0.0185 -4.34e-3 -0.000397
3 discount_appliedYes -0.153 0.0709 -2.15 0.0313 -2.92e-1 -0.0137
4 age:discount_appliedY… 0.00306 0.00152 2.01 0.0443 7.74e-5 0.00604
3.3.2. Model 3 Results
- Since “no” comes first alphabetically, people who did not have a discount applied are the “baseline comparison group”. Therefore, the intercept (b0 = 3.86) represents the intercept for only the group who did not receive a discount.
The estimate for the slope for age (bage = -0.0024) is the associated change in review rating for every increase of one year in age. Every increase of one year, there is a 0.0024 decrease in review rating.
The estimate for the group that did get a discount applied (Bdiscountappliedyes = -0.15) is the offset in intercept relative to the group who did not get a discount (baseline).
3.3.3. Model 3 Interpretation
The first null hypothesis is that there is no relationship between age and review rating at the population level (the population slope is zero).
There appears to be a possible negative relationship between age and review rating for consumers Bage= -0.0024. There appears to be a meaningful relationship since in the table, we see
the 95% confidence interval for the population slope Bage (-0.00434, -0.00040), zero is not included within this interval
The p-value (p = 0.019) is greater than 0.05, but still less than 0.1, indicating weak evidence against the null hypothesis
Therefore, the relationship does indeed appear to be negative.
The second null hypothesis that we are testing is that the difference in intercept for the non-baseline group is zero.
The 95% confidence interval for the group who received a discount is (-0.29, -0.014). This interval does not contain zero, therefore it is not plausible that the intercept is the same.
The p-value is 0.031 which is greater than 0.05, but still less than 0.1 indicating weak evidence against the null hypothesis.
Because the previous two null hypotheses could not be rejected, we must address the third null hypothesis that there is no relationship between the interaction of age and discount applied and review rating.
The 95% confidence interval for the interaction is (0.000074, 0.0060). This interval does not contain zero, therefore it is not plausible to reject the null hypothesis.
The p-value is 0.044 is less than 0.05 indicating that there is moderate evidence against the null hypothesis.
4. Conclusions
We found that (1) there was no significant difference in the amount purchased in USD based on age for people with different frequencies of purchases, (2) there was no significant difference in the amount purchased based on number of previous purchases for people in different age groups, (3) There is moderate evidence for a difference in review rating based on age for people who got a discount versus those who did not. For Model 3, we found that when there was no discount applied, as age increased, review rating decreased. When there was a discount applied, as age increased, review rating increased. For model 1, we found moderate evidence that as age increased, purchase amount increased for those who purchased biweekly and quarterly.
# A tibble: 12 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 58.4 2.86 20.4 3.32e-88
2 age 0.0308 0.0614 0.502 6.16e- 1
3 payment_methodCash 0.803 3.97 0.202 8.40e- 1
4 payment_methodCredit Card 0.386 4.00 0.0965 9.23e- 1
5 payment_methodDebit Card 0.789 4.08 0.193 8.47e- 1
6 payment_methodPayPal 6.66 4.07 1.64 1.02e- 1
7 payment_methodVenmo 4.61 4.06 1.13 2.57e- 1
8 age:payment_methodCash -0.0185 0.0851 -0.218 8.28e- 1
9 age:payment_methodCredit Card -0.00106 0.0852 -0.0125 9.90e- 1
10 age:payment_methodDebit Card 0.00925 0.0876 0.106 9.16e- 1
11 age:payment_methodPayPal -0.162 0.0875 -1.85 6.44e- 2
12 age:payment_methodVenmo -0.123 0.0876 -1.40 1.61e- 1